library(plotly)
library(data.table)
library(tidyr)
library(knitr)
library(heatmaply)

Preprocessing

  • Load data file
  • assign first column as row names
  • clean row names
  • rename genres for better readability
    • “Religion, Spirituality & New Age” to “Religion”
    • “Science.fiction” to “SciFi”
    • “Action.and.Adventure” to “Action”

All genres:

##  [1] "Satire"        "SciFi"         "Drama"         "Action"       
##  [5] "Romance"       "Mystery"       "Horror"        "Self.help"    
##  [9] "Health"        "Guide"         "Travel"        "Children.s"   
## [13] "Religion"      "Science"       "History"       "Math"         
## [17] "Anthology"     "Poetry"        "Encyclopedias" "Dictionaries" 
## [21] "Comics"        "Art"           "Cookbooks"     "Diaries"      
## [25] "Journals"
  • Check if upper and lower triangle identical
## [1] TRUE
  • Transform to long and tidy data.table
head(books_dt)
##     genreA genreB customers
## 1:  Satire Satire      3798
## 2:   SciFi Satire       423
## 3:   Drama Satire        19
## 4:  Action Satire       343
## 5: Romance Satire       505
## 6: Mystery Satire       227
  • Average number of genres per customer
## [1] 2.332187

First ideas

Show me everything!

  • Romance, SciFi, Action, History are most bought and build a bought-together cluster
  • Dictionaries and Comics
  • Math and Poetry
  • Mystery is an outlier

Most bought genre

Best pairs

Special genres

  • Look for customers that buy only one genre
    • Compare column sum and 2*diagonal value
    • generate table with {genreA=<genre>, genreB=NA, customers={2*diagonal-colSum}}
  • If a customer buys more than 2 genres, he is recorded in more than 1 off-diagonal entry –> (2*diagonal - colSum) < 0
  • If a genre is bought more often alone than in triplets (or higher), (2*diagonal - colSum) > 0